What is claimed is: 



1 . A method of validating a byte sequence, the method comprising: 
defining a plurality of states for the byte sequence; 

designating one or more noise states from among the plurality of states; 
generating a most probable state sequence for the byte sequence; 
utilizing said state sequence to identify all noise in the byte sequence; and 
localizing said noise in said noise states. 

2. The method of claim 1 further comprising deleting said noise from the 
byte sequence. 

3. The method of claim 1 wherein an ASCII state is also designated as a noise 

state. 

4. The method of claim 1 wherein said generating of a most probable state 
sequence comprises calculating P(Xo . . . X n | So . - . S n ), representing the conditional 
probabilities of said byte sequence given a state sequence. 

5. The method of claim 4 wherein said calculating P(X 0 ... X n ] So . . . S n ) 
comprises assigning a state label Si to each i* byte X, of the byte sequence so as to 
maximize the equation: 
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P(X 0 . . . X N | So . . . S N ) = Po(So) 



i=\ 



flBiX, I S,) 



i=0 



wherein Po(So) is the initial distribution of states; A(S i \ S M ) is a "state-to-state" 
transmission matrix; and B{X t \ S t ) is a "byte-from-state" matrix of the probabilities of 
generating a byte value Xj given a state Si. 



The method of claim 5, wherein: 



where each /?(Si-i — >Si) is the probability that a particular Si state immediately follows an 
Sj-i state in a valid byte sequence having a states. 



The method of claim 8, wherein: 



10 



p(A^A) p{A->GB\) p(A-*GB2) ' 
p(GBl -> A) p(GB\ -> GBl) p{GB\ ->• GB2) 
p(GB2 -> A) p(GB2 -> GBl p(GB2 -» GBl) 



where each /?(Sm — »Si) is the probability that a particular Si state immediately follows an 
Si-i state in a valid byte sequence having three states. 



The method of claim 7, wherein: 
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0.995157 0.004843 0 
0 0 1 

0.037944 0.962056 0 



and said valid byte sequence is valid text in the GB 23 12-80 character set. 



9. The method of claim 5, wherein: 



B(X i \S i ) = 



MX^x.+i) 

^(Z,=^+l) 



B,(X t = X l +l) 

6,^ =X r ) 
B r (X t =X r +\) 



h x (X t =x r = 255) s , (X t = x r = 255) 



Kix^x,) 

h^X^x.+l) 

K{x t =x r ) 
Kix t =x r ) 

A„(*,=255) 



where /*s(X) are histogram functions of the a states and Sj(Xj) are probabilities of 
associating noise with the noise state for bytes within r+1 ranges of byte values Xi. 
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10. The method of claim 9, wherein; 



B{X t \S,) = 



h A {X t =l) 



8,(^=128) 



0 
0 



0 
0 



s l (X i =160) 0 0 

s 2 (X i = 161) h x (X t = 161) h 1 {X i = 161) 



e 2 (X ( . =254) =254) h 2 {X t =254) 

e 3 (Z,.=255) 0 0 



where fc s (Xi) are histogram functions of the states, and ej(Xi) are probabilities of 
associating noise with the ASCII state within a plurality of X; ranges for a three-state 
byte sequence. 



1 1 . The method of claim 1 0 further comprising: 

providing a valid three-state byte sequence having an ASCII state and 
comprising valid ASCII and two-byte characters; 

computing an ASCII histogram /ja(XQ by a method comprising: 
10 sampling valid ASCII text so as to measure the frequency of 

occurrence of each byte value; 

computing an unnormalized ASCII histogram of said sampling 
over the ASCII range of X; and 



YOR9-2001-0229 (8728-506) 



-23- 



normalizing said unnormalized ASCII histogram such that the 
entire column of the matrix containing said ASCII histogram sums to 1; 
computing a first-byte histogram Ai(Xi) by sampling valid two-byte text 
and computing the unnormalized first-byte histogram over the odd bytes, and 
5 normalizing said first-byte histogram such that the entire column of the matrix containing 
said first-byte histogram sums to 1; and 

computing a second-byte histogram h 2 (X{) by sampling valid two-byte text 
and computing the unnormalized second-byte histogram over the odd bytes, and 
normalizing said second-byte histogram such that the entire column of the matrix 
10 containing said second-byte histogram sums to L 

12, A program storage device readable by machine, tangibly embodying a 
program of instructions executable by the machine to perform method steps for 
validating a byte sequence, said method comprising: 

defining a plurality of states for the byte sequence; 
1 5 designating one or more noise states from among the plurality of states; 

generating a most probable state sequence for the byte sequence; 
utilizing said state sequence to identify all noise in the byte sequence; and 
localizing said noise in said noise states. 

13. The device of claim 12 wherein said localizing of said noise in said noise 
20 states comprises: 
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examining each byte in said byte sequence that does not correspond to a 

noise state; 

determining if the byte is valid; and 

if the byte is not valid, then redesignating the state of said byte to a noise 

5 state. 



14. The device of claim 13 further comprising: 
a lookup table of valid bytes; and 

wherein said determination if a byte is valid is accomplished by accessing 
said lookup table. 



10 1 5 . A method of validating a byte sequence, the method comprising: 

defining a plurality of states for the byte sequence, including at least one 

ASCII state; 

designating at least one ASCII state as the noise state; 
generating a most probable state sequence for the byte sequence by a 
1 5 method comprising: 

calculating P(Xo ... X n | So . . . S n ), representing the conditional 
probabilities of said byte sequence given a state sequence; 

wherein said calculating P(X 0 ... X n | So . . . S n ) comprises 
assigning a state label Si to each i th byte Xj of the byte sequence so as to 
20 maximize the equation: 
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P(Xo . . . X N I So . . . S N ) = Po(So) 



wherein P 0 (S 0 ) is the initial distribution of states; v ' 1 M; is a 

"state-to-state" transmission matrix; and B ^ Xi ' ^ is a "byte-from-state" 
matrix of the probabilities of generating a byte value X* given a state Si; 
utilizing said state sequence to identify all noise in the byte sequence; 
localizing said noise in said noise states; and 
deleting said noise from the byte sequence. 
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